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The increase with time of computer resources devoted to simulations of full QCD 
is spectacular. Yet the reduction of systematic errors is comparatively slow. This 
,,^J ' is due to the algorithmic complexity of the problem. I review, in elementary terms, 

Qj , the origin of this complexity, and estimate it for 3 exact fermion algorithms. 

oo . 

There is a qualitative difference between quenched QCD, which consid- 
ers the dynamics of (bosonic) gauge fields, and full QCD, which includes the 
^ , dynamics of (fermionic) quarks, and thereby the physical effects of virtual 

^Z^ ' quark pair creation. The former has only short-range interactions; the lat- 

ter has long-range interactions among the gauge fields {U}. How can this 
fv^ ' happen, since fermionic fields "0 interact with each other locally, with ac- 

f^ , tion density iplpllU}) + m)'4> ? The reason is simple. Fermions anti-commute 

t^^ ' {ip{x)ip{y) — —'4){y)4'{x)), so that they cannot be simulated directly. The stan- 

^^ \ dard procedure consists in integrating them out of the partition function 



p^25^e"^->.'^^"^^^+"^'^^^^ 



The exponential can be expanded in a power series. Because of anti-commutation, 
the only surviving terms will be / dipd'4'ip{x)'ip{x) = 1, and the result of the in- 
tegration will be det{p{{U})+m) for each quark flavor, or det^'f {P{{U}) + m) 
for Uf flavors degenerate in mass. This determinant cannot be factorized into 

5J] ' local pieces, hence the long-range interactions induced on the gauge fields. 

C^ , 

1 Scaling limit 

In the presence of fermions, QCD possesses 2 mass (or length) scales: the string 
tension y/a ~ MOMeV of the pure gauge theory, and the quark mass m,. Both 
should be much smaller than the inverse lattice spacing a~^, to ensure small 
discretization errors. In addition the physical box {La)'^ represented by the 
lattice should be large enough to contain the interesting physics. Naively, 
these requirements would translate into 

a^/a <C 1 <C Lay^ (a) , . 

aruq ^ 1 <C Laruq (b) ^ ' 



However because quarks are confined, the largest-size object to be contained 
in the box is not a free quark, but a qq meson. For heavy quarks, its mass is 
indeed ex rriq. But for Ught quarks, the pion mass obeys the PC AC relation 

ml ^ Bniq (2) 

So confinement helps us here, by replacing 0(b) by 

amriq ^ 1 <C La\/By/mq (3) 

Therefore, one needs to distinguish 2 cases: 

• heavy quarks {niq ^ y/a) : constraints Wa) and 0(b) can be summarized 
by 

amq < 1 < Lay/a (4) 

• light quarks {mq ^ y/a) : then yj(a) and (|3|) are equivalent to 

a^/a < 1 < LaVBy^ (5) 

We are ultimately interested in simulating the light quark case, which corre- 
sponds to the real world, so we will try to estimate the complexity of various 
fermion algorithms in that regime. Note however that present-day simulations 
of full QCD are only probing the heavy-quark domain: the crossover to light 
quarks, which corresponds roughly to the strange quark mass, to the opening 
of p ^ TTTT decay, and to dominant sea-quark effects, has not been reached yet. 
In any case, the strategy for a light-quark simulation should be: 

(1) choose the lattice spacing so that the left-hand side of inequality (||) is 
satisfied; 

— 1/2 

(2) as the quark mass is decreased, increase L oc iriq to preserve the right- 
hand side of the inequality. 

1 will now compare the behavior of 3 exact algorithms as ruq — > 0. 

2 The simplest algorithm: Hnk-by-link MetropoUs 

Consider a local change in the gauge field (say one link only) . This will induce 
a local change A in the p part of the Dirac matrix D =P + m. The elements 
of matrix A are all zero, save for, say, l^xo,xa+li. and l^xo+ij..xai keeping only 
spatial indices for clarity. Consider then a Metropolis update. The acceptance 
probability will be 

P.,..™„a,(^=M£±^)"', 



So what is needed is 

Thus only 4 elements of the inverse matrix D~^ are needed. A straightforward 
way to compute them is to solve the linear systems 

Dzi,2 = 61,2 (8) 

where 61,2(2^) = Sx.xo ^^^d 6x,xo+fi respectively, using an iterative solver, to 
sufficient accuracy. The solution vector z represents a whole column of -D^^, 
from which the useful elements can be extractedp 

2.1 Cost analysis 

Using this algorithm, what is the cost of generating an independent configura- 
tion? 

I make the plausible assumption that, as the quark mass goes to zero, the num- 
ber of iterations needed to solve (H) grows as l/niq. This assumption certainly 
hoick for staggered fermions, and appears to hold also for Wilson fermions (see 
eg.B). 

The work required per link update is then proportional to Vm ^, or V^m ^ 
per sweep. Since the relevant correlation length as discussed in H^ is m^^ ^ 
niq ' B^^/^, the number of sweeps necessary to decorrelate the gauge field 
will be {niq ' B^^^^Y, or ~ w~^, since z « 2 for a Metropolis algorithm. 
Altogether, the work per independent configuration is 

V'm-^ (9) 

or, since the lattice size L must scale as rriq (see eq.||) 

L^^ (10) 

or a^^^-f| The prefactor can be considerably reduced in clever variants of this 
methods But the algorithmic complexity remains daunting. It is even possible 
that (|lO|) actually underestimates it. The statement that the number of sweeps 
per independent configuration grows as m~^ relies on the assumption that the 

"For a non-confining theory in d dimensions, the correlation length would become rriq , 
giving a cost per independ 
giving a complexity L^d+a 



giving a cost per independent configuration V^m„ ; the lattice size would scale as m„ , 



change in each Hnk update is not hniited by further constraints. On the other 
hand a naive inspection of the determinant (M) indicates that A might scale 
as ruq, to compensate the m~^ divergence of D~^. If the step size is ~ niq, 
the number of steps needed to explore the whole gauge group and obtain a 
decorrelated configuration would be ~ JtiI"^, instead of m~^ as stated above. 
In that case the complexity would be 

V'^m-^ or L^"* (11) 

Whether in practice the step size goes like ^to^ or raq, it is interesting 
that this algorithm, which in principle admits Monte Carlo steps of any size, 
automatically restricts them as the quark mass decreases. This restriction 
creates further problems, of ergodicity and of slowing-down, in the presence of 
an energy barrier like a zero-mode of the Dirac matrix separating successive 
topological sectors. These problems are entirely neglected here: a comparative 
study of fermionic algorithms with regard to exploring topological sectors is 
still awaiting. 



3 Hybrid Monte Carlo 

The magic of Hybrid Monte Carlo (HMQa) consists in updating all links si- 
multaneously by a very small step, which allows a simplifying linearization of 
the problem. The error caused by the finiteness of the step size is periodically 
corrected by a Metropolis test. HMC is the standard fermioniji|algorithm. It 
has been reviewed several times, and its complexity conjecturecHfl and studied 
numericallyLl 

HMC first introduces one species of auxiliary bosonic fields 0, through the 
Gaussian integral 

The squaring of [p -t- m) guarantees the convergence of the Gaussian integral 
(since it may happen, for small quark masses, that the lattice Dirac operator 
develop eigenvalues with a negative real part). It also makes it very cheap 
to update the field by a heatbath: <— [p + m)^rj^ where 77 is a complex 
Gaussian vector. 

Heatbath updates of will alternate with updates of the gauge links. This 
is accomplished by introducing fictitious momenta Px.^ conjugate to the gauge 
fields Ax^^ {Ux,^ — exp{iAx^p,)), and a fictitious Hamiltonian 



H = Y^^-^ + Sg + 4>H{P + m)\p + m)]- V (13) 

where Sq is the gauge action. The last 2 terms form the potential energy V of 
the gauge fields. After a heatbath update of 0, the momenta pj;^^ are initialized, 
also by heatbath p^ <— rj. Then a discrete integration of Hamiltonian evolution 
is pursued along a "trajectory" of length 0(1) in fictitious time, using generally 
the leapfrog integrator: 

A(i + ^) = ^(i) + yp (14) 

p{t + 5T) = p{t)-6TVV (15) 

Ait + Sr) = A(i+^) + ^p (16) 

The algorithm, at this point, is exact up to integration errors, which cause in 
particular violations AE of the conservation of the total Hamiltonian energy. 
These deviations are compensated for by a Metropolis test at the end of the 
trajectory, with acceptance Pace = min{l^e~^^). Thus HMC can be viewed 
as an elaborate Metropolis scheme, where the candidate configuration {Unew\ 
is obtained from {Uoid\ not by a random step, but by a carefully guided step. 
Detailed balance is satisfied because the leapfrog integrator is symplectic and 
reversible, guaranteeing the "evenness" of the distribution of proposed changes. 

3.1 Cost analysis 

• The optimal step size is the result of a compromise between fast Hamilto- 
nian integration and high Metropolis acceptance. Very briefly, the RMS 
energy violation per step is 6E ~ \/Vm~^5T^ . The ^/V comes from the 
uncorrelated contributions over a large volume. The m~^ST^ conies from 
the leading error in a reversible (second-order) integrator. Over a trajec- 
tory of 1/St steps, the RMS energy violation is then AE ~ -\/Fto~^<5t^, 
which must be k&pt constant to preserve a constant acceptance. Hence 

• The cost per step comes overwhelmingly from the calculation of the 
fermionic force, requiring 0{m'Z^) iterations of a linear solver. 

The total cost per trajectory of length 1 is thus ^ V^^^mq . 
Different scenarios have been proposed as to the number of trajectories neces- 
sary for decorrelation. 



(a) Trajectories of length 0(1) are sufficient to guarantee a dynamical critical 



exponent z = 1. The correlation length is ^ 


' Vflq 


, so that the total cost is 


V^^^m^^ or 


L" 


(17) 


-1/2 j- 

smce mq (x L. 




n 



(b) If one increases the trajectory length like the correlation length, then z = OS 
The step size becomes br ^ V^^l^raq , and the cost is 

T/5/4^-13/4 or L"-5 (18) 

(c) With trajectory of length 0(1), the critical exponent z is 2. Then the cost 

^5/4^-7/2 or Ll2 (19) 

In my opinion, scenario (a) is too optimistiG..(a similar scenario for the quenched 
theory has been numerically proven wron^) . I favor (c) . 

HMC is straightforward to program, even with improved, less local gauge 
or fermionic actions. It directly benefits from recent progress in linear solvers. 
It can be further accelerated, by trading the Dirac matrix D for another one 
of same determinant, but giving a smaller force. The replacemeiit-of D by an 
LDU combination allows a large (0(5)) increase of the step sizeE2l 
One potential problem with HMC on large lattices comes from round-off er- 
rors. The Metropolis acceptance depends on the energy difference AE, which 
is a small 0(1) difference of 2 large numbers 0{V). Special care must be 
taken to calculai.e_A_E accurately enough that violations of reversibility re- 
main negligibleEjii3 The next algorithm circumvents this problem. 

4 The multiboson approach 

This_approach, originally proposed by Liischeii3, can be summarized as follows 
(seellil for a detailed presentation): 

• Choose a Chebyshev-like polynomial Pn{x) approximating l/x in a do- 
main of the complex plane which includes the eigenvalue spectrum of the 
Dirac operator D. Pn can be factorized as 

Pn{x) = constant Wl^^{x - Zk) (20) 

• It follows, by going to an eigenbasis of D, that 

detDK. constant li^^^det^^ {D ~ Zkl) (21) 
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• 



• 



or 

I detD 1 2 « constant U^^ ^ dei" ^ (D - z^ 1 ) ^f (£> - z^ 1 ) (22) 

Each factor in the right-hand side can be replaced by a Gaussian integral 
over an auxiliary field 0^ . The effective action is then 



Seff =PSg+Y^ (f>iiD - Zkl)HD - zk\)(t)k (23) 

fc=i 

Update the f/'s and the 0fc's by a local MC algorithm (typically a mix- 
ture of over-relaxation and heatbath) for a reversible sequence of steps 
("trajectory"). 

Correct the error \detDPn{D)\^ in the measure by a cheap Metropolis 
test at the end of each trajectory. 



In addition to avoiding the accuracy problems of HMC, this method relies 
on manifestly finite update steps, so there is hope that it will be more efficient 
in overcoming energy barriers. Moreover, it generalizes readily tOjjajther func- 
tions oidetD, as needed for simulations of odd numbers of flavors^ or SUSY 
theoriesE-3 The price to pay for these potential advantages is a large increase in 
memory (the number of auxiliary fields needed is about the same as the num- 
ber of iterations of the linear solver in HMC) , and a programming complexity 
which increases very rapidly with the range of the operator D. 

4-.1 Cost analysis 

• The work per step grows like n, the number of bosonic fields. 

• During a link update, the bosonic fields are kept frozen. They limit the 
size of the update step to ex fiTM.^, so that the autocorrelation time of 
the [/'s grows linearly with nll3'llil 

• The optimal choice for n is therefore a compromise between fast evolution 
and low Metropolis acceptance. The approximation error |AP„(A) — 1| for 
the Chebyshev-like polynomials considered decreases exponentially with 
n, so that it will be bounded by 0{e~'^""^'') for some constant c. To keep 
the Metropolis acceptance constant, one must preserve Log{detDPn{D)) ^ 
Y^-cnmc_ Therefore, n ex ni~^LogV. 

• One expects additional slowing down m^^ from the local MC dynamics 
of fields (j)k ■ 



Altogether these factors yield a cost per independent configuration V{LogV)^m^- 
Since the (pk dynamics are local, one^can expect at best z = 1, which is con- 
sistent with numerical observationscj A more conservative scenario would be 
z = 2. Clearly, a cluster algorithm with z < 1 would make this algorithm very 
competitive. In any case the cost is, for z — 1 



or, for z — 2 



ViLogVym^^ or L''\LogLY 



V{LogVfm~'^ or L^^LogL)^ 



(24) 
(25) 



5 Summary 

Table 1 summarizes our analysis. It is remarkable that all algorithms ulti- 
mately give an L^^ dependence as the quark mass approaches 0. The reason 
for this probably is that all 3 algorithms are equally incapable of accelerating 
the evolution of the relevant approximate aero-modes of the Dirac operator. 
The implementation of the Cornell progranEil (Fourier acceleration of MC dy- 
namics, after Fourier-accelerated gauge-fixing) could improve this picture. On 
the other hand, the relevant dynamics as niq — > is most likely the motion 
through topological sectors, which our analysis ignores. 

Topological sectors are separated by narrow but high energy barriers. The 
dynamics of barrier-crossing is likely to be sensitive to the size of an elementary 
update step, which is hsted in Table 2. Lack oLergodicity appears especially 
dangerous with HMC, as noted some time agota 



Table 1: Complexity of fermionic algorithms - conservative scenario. 





Link-by-link 


HMC 


Multiboson 


TTiq fixed 


V'm~^ 


V^/^m;'^'' 


V{LogV)^m-^ 


T -1/2 
L ex Ulq 


Li2 


Li2 


Li2 



Table 2: Size of an elementary update step. 





Link-by-link 


HMC 


Multiboson 


mq fixed 


1/2 

niq' 


V-^/^mf' 


(io5l/)"i/2^V2 


T -1/2 

L (xruq ' 


L-1 


L-2 


L-i(Logy)-V2 



For current simulations, with not-so-light quarks in not-so-large volumes, 
the first line of Table 1, which separates the complexity in volume and in quark 
mass, may be the more relevant. The 3 algorithms reviewed, presented in 
chronological order, progressively trade an increased dependence on the quark 
mass for a decreased dependence on the volume. This represents progress for 
fixed quark mass and large volumes (meaning m-^La S> 1): in that case, the 
multiboson method should be the most efhcient. 

This has in fact been verified in the cuirent study of 1-flavor QCD thermody- 
namics, with heavy dynamical quarkslij The dynamics of the Polyakov loop 
were ~ 10 times faster with the multiboson method than with HMC. 
On the other hand, for very light quarkS|-(in a small volume), HMC and the 
multiboson methods perform equivalentlyu3 

Finally, I stress that these are costs per independent configuration. The 
number of independent configurations required to preserve the statistical signal- 
to-noise ratio may increase as mq decreases, depending on the observable. 

6 Conclusion 

Lattice QCD simulations started in 1980, at the initiative of one of this meet- 
ing's participantsE3 Progress in algorithms has since been as impressive as in 
hardware. 

On one hand, lattice QCD is a homogeneous problem on a regular grid: it be- 
longs to the class of "embarrassingly parallel" problems, for which the simplest, 
cheapest SIMD parallel computer is sufficient. Current simulation projects call 
for 0(1 Gigaword) of data, 0{a. few months) of computer time on 0{a. few 100) 
GFlops machine. The resources involved justify continued research on algo- 
rithms. 

On the other hand, the current situation is quite different in the quenched and 
the full QCD cases. Quenched QCD is a "mature" field: algorithms are stag- 
nant; a continuum extrapolation of MC results, in large volumes, is reliable 
at the 10% level. Full QCD still is at an early stage: algorithms are still be- 
ing explored and improved; 8 years elapsed before a practical exact algorithm 
(HMC) was devised; 8 more years passed by before a competitive alternative 
(multiboson) was found; more progress will undoubtedly come. The cost of 
current algorithms is still so high that simulations are generally squeezed in a 
dilemma: either the lattice is too coarse, or the physical volume is too small. 
The interesting regime of light dynamical quarks is barely starting to be ex- 
plored. This slower progress is a consequence of the tremendous complexity 
~ L^"^ of full QCD (a similar analysis for quenched QCD yields a complexity 
~L6 "only"). 
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The relevance of the complexity analysis presented in this review is limited. 
Table 1 only shows asymptotic behaviors (large volume, small quark mass) —La 
any case, prefactors must be determined through numerical experimentaCJ 
But one outcome should be clear: an improved action, which would reduce 
discretization errors such that the lattice spacing can be doubled, offers a 
potential savings of 2^^ ~ 4000 in computer time. 
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